Vast Challenge MC2 II

This blog is for the Vast Challenge 2021, Mini Challenge 2 Q3, Q4, and Q5.

Linli Zhong https://github.com/LinliZhong
07-23-2021

Note: This is the solution for Q4 and Q5 of Vast Challenge MC2. To view solution for Q1, Q2 and Q3, please click here(https://linlizhong.netlify.app/posts/2021-07-25-assignment-mc2/)

1. Overview

1.1 Background

GAStech is a natural gas production company in an island country of Kronos in the past twenty years. It builds strong relationships with the government of Kronos but low environment stewardship. In January 2014, when celebrating the success fortune by IPO (initial public offering), several employees of GAStech went missing.

This post is focused on VAST Challenge 2021 Mini-Challenge 2. Mini-Challenge 2 provides useful data of missing employees’ relevant information, a list of vehicle assignments by employee, vehicle tracking data, loyalty card transaction data, credit and debit card transaction data, and a tourist map of Abila. In this post, I will make use of R language to visual and analyze available data and identify the suspicious activity of the employees.

1.2 Literature review

The task of Mini Challenge 2 in 2014 was to find unusual patterns in employees’ daily lives from their credit/debit/loyalty card records and the GPS tracking records of their cars. In this article, I mainly refer to two works by Central South University and KU Leuven.

For Central South University, they used D3, MySQL, Excel as their visualization tools. They first identified daily routines for GAStech employees. They divided these missing employees into general staffs and truck drivers. Then they discuessed these two groups separately. They detected suspicious activities by visualization the credit card and gps data.

For KU Leuven, they had three stages to visualize. (1) Aggregate & Slice, (2) Design, Filter & Analyze and (3) Communicate. They first made full use of R to test their hypothesis. Then involved the design and implementation of streamlined tools to optimize the identification of specific patterns. After that, they presented and visualized the results.

2. Data Preparation

2.1 Install and load packages

The code chunk below is to check required packages are installed or not, if they are not installed, the code chunk will help install them automatically. After all packages are installed, the code chunk will load them.

packages = c('igraph','ggraph','visNetwork','lubridate','dplyr',
             'clock','tidyverse','DataExplorer','gplots','plotly',
             'ggplot2','superheat','tm','plotly','lattice','GDAdata','raster',
             'sf','tmap','foreign','gifski','rgdal','tiff','sp','leaflet',"DT"
             )
for(p in packages){
  if(!require(p, character.only = T)){
  install.packages(p)
  }
  library(p,character.only = T)}

2.2 Import relevant data

The data is stored in MC2 file. Read_csv()is to import four csv files into R. “Windows-1252” is a single-byte character encoding of the Latin alphabet for value “Katerina’s Café”.

cc<- read_csv("MC2/cc_data.csv",locale = locale(encoding = "windows-1252"))
lyt<-read_csv("MC2/loyalty_data.csv",locale = locale(encoding = "windows-1252"))
gps <-read_csv("MC2/gps.csv")
car<-read_csv("MC2/car-assignments.csv")

2.3 Data preparation

1. Change data type. The code chunk below is to convert id in gps from number to factor, last4ccnum in cc from number to character and timestamp in gps, loyalty, and credit card from character data type to Datetime format.

gps$id<-as.factor(gps$id)
car$CarID<-as.factor(car$CarID)
cc$last4ccnum<-as.character(cc$last4ccnum)
gps$Timestamp<-date_time_parse(gps$Timestamp,zone = "",format = "%m/%d/%Y %H:%M:%S")
lyt$timestamp<-date_time_parse(lyt$timestamp,zone = "",format = "%m/%d/%Y")
cc$timestamp<-date_time_parse(cc$timestamp,zone = "",format = "%m/%d/%Y %H:%M")

2. Combine data. The code chunk below is to combine first name and last name of employee together in the list of car.

car$Name<-paste(car$FirstName,car$LastName)

3. Add additional column. When analyzing the car assignment csv file, we may notice that the GAStech does not assign specific company truck for each truck driver, but this is not a case for other employees with title which is not truck driver. Hence, we can divide missing employees into two types, one is general employees, another one is truck driver, to further investigate their suspicious activities. Besides, add additional column for car type, and there are two types of cars, one for company truck another one for company car.

car$Category<-ifelse(car$CurrentEmploymentTitle == "Truck Driver", 
                       "Truck Driver","General Staff")
gps$car_type<-ifelse(gps$id %in% c('101','104','105','106','107'),'Company truck',
                     'company car')

The code chunk below is to extract day of the timestamp and add it in column called “day” in each file.

cc$day<-get_day(cc$timestamp)
lyt$day<-get_day(lyt$timestamp)
gps$day<-get_day(gps$Timestamp)

The code chunk below is the first part of vast challenge.

cc4<-cc
cc4$timestamp<-format(cc$timestamp,format="%Y-%m-%d")
lyt4<-lyt
lyt4$timestamp<-format(lyt4$timestamp,format="%Y-%m-%d")
cc_lyt<-merge(x=cc4,y=lyt4,all.x = TRUE)
x <-if_else(is.na(cc_lyt$loyaltynum),"No","Yes")
cc_lyt$use_loyalty<-x

suspicious_cc<- c("9551", "8332", "3484","5407","8156")
sus_day <- c(12,13,19)
cc_lyt<-cc_lyt%>%
  filter(day %in% sus_day, last4ccnum %in% suspicious_cc, 
         location == "Kronos Mart")
DT::datatable(cc_lyt)
#Import raster file
bgmap <- raster("MC2/Geospatial/MC2-tourist.tif")
# import the Abila shapefile
Abila_layout <- "MC2/Geospatial/Abila.shp"
Abila_shape <- readOGR(Abila_layout)
OGR data source with driver: ESRI Shapefile 
Source: "E:\LinliZhong\DataViz_blog\_posts\2021-07-25-mc2-part-2\MC2\Geospatial\Abila.shp", layer: "Abila"
with 3290 features
It has 9 fields
Integer64 fields read as strings:  TLID 
#Plot raster layer
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,alpha = NA,
       saturation = 1,interpolate = TRUE, max.value = 255)+
  tm_shape(Abila_shape)+
  tm_lines(col= "grey", lwd=1)
#Import GIS data file
Abila_st <- st_read(dsn = "MC2/Geospatial",
                    layer = "Abila")
Reading layer `Abila' from data source 
  `E:\LinliZhong\DataViz_blog\_posts\2021-07-25-mc2-part-2\MC2\Geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS:  WGS 84
#Converting aspatial data into a simple feature data frame
gps_sf <- st_as_sf(gps, coords = c("long", "lat"),crs= 4326)
#Creating movement path from GPS points
gps_path <- gps_sf %>%
  group_by(id, day) %>%
  summarize(m = mean(Timestamp), do_union=FALSE) %>%
  st_cast("LINESTRING")
gps_track <- read_csv("MC2/gps.csv")
gps_track$id<-as.factor(gps$id)
gps_track$Timestamp<-date_time_parse(gps_track$Timestamp,zone = "",
                                     format = "%m/%d/%Y %H:%M:%S")
gps4<- gps_track
gps4<- dplyr::rename(gps4, arrival_time = Timestamp)
gps4$date<-as.Date(format(gps4$arrival_time, "%Y-%m-%d"))
gps4$day_of_week<-wday(gps4$arrival_time, label=TRUE)
gps4 <- gps4[order(gps4$id,gps4$date,gps4$arrival_time), ]
gps4 <- gps4 %>%
  group_by(id) %>%
  mutate(departure_time = lead(arrival_time))%>%
  mutate(time_difference = departure_time - arrival_time)%>%
  filter(time_difference >= 18000)
gps4$arrival_time<-format(as.POSIXct(gps4$arrival_time), format = "%H:%M:%S")
gps4$departure_time<-format(as.POSIXct(gps4$departure_time), format = "%H:%M:%S")

gps5<-gps4%>%
  relocate(id,date,day_of_week,arrival_time,departure_time,lat,long,time_difference)

gps_sf <- st_as_sf(gps5, coords = c("long", "lat"),crs= 4326)
#Creating movement path from GPS points
gps_path <- gps_sf %>%
  group_by(id, date,day_of_week,time_difference,arrival_time,departure_time)
gps_path_selected <- gps_path%>%
  filter(date == "2014-01-07")
tmap_mode("view")
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,alpha = NA,saturation = 1,
       interpolate = TRUE,max.value = 255) +
  tm_shape(Abila_shape)+
  tm_lines(col= "grey", lwd=1)+
  tm_shape(gps_path_selected)+
  tm_dots()

3. MC2 Visualization Preparation

3.4 Question 4

Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. Please limit your response to 8 images and 500 words.

gps_track <- read_csv("MC2/gps.csv")
gps_track$id<-as.factor(gps$id)
gps_track$Timestamp<-date_time_parse(gps_track$Timestamp,zone = "",
                                     format = "%m/%d/%Y %H:%M:%S")

gps_sf <- st_as_sf(gps5, coords = c("long", "lat"),crs= 4326)
#Creating movement path from GPS points
gps_path <- gps_sf %>%
  group_by(id, date,day_of_week,time_difference,arrival_time,departure_time)
gps_path_selected <- gps_path# %>%
  #filter(date == "2014-01-07")
tmap_mode("view")
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,alpha = NA,saturation = 1,
       interpolate = TRUE,max.value = 255) +
  tm_shape(Abila_shape)+
  tm_lines(col= "grey", lwd=1)+
  tm_shape(gps_path_selected)+
  tm_dots()

3.5 Question 5

Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why Please limit your response to 10 images and 500 words.

3.4 Question 4

Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. Please limit your response to 8 images and 500 words.

tmap_mode("view")
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,alpha = NA,saturation = 1,
       interpolate = TRUE,max.value = 255) +
  tm_shape(Abila_shape)+
  tm_lines(col= "grey", lwd=1)+
  tm_shape(gps_path_selected)+
  tm_dots()

3.5 Question 5

Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why Please limit your response to 10 images and 500 words.

p4<-plot_ly(data = cc, x = ~day, y=~price, marker = list(color = 'rgb(158,202,225)',
                           line = list(color = 'rgb(8,48,107)', width = 1.5)))
p4 <- p4 %>% layout(title = "Total price per day by credit card",
                                      xaxis = list(title = "Day"),
                                      yaxis = list(title = "Total price"))

p5<-plot_ly(data = lyt, x = ~day, y=~price, marker = list(color = 'rgb(158,202,225)',
                           line = list(color = 'rgb(8,48,107)', width = 1.5)))
p5<- p5 %>% layout(title = "Total price per day by loyalty card",
                                      xaxis = list(title = "Day"),
                                      yaxis = list(title = "Total price"))
p4
p5
p6<-ggplot(cc, aes(x = reorder(location,price), y=price)) + 
    geom_boxplot(color = "deepskyblue4")+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))+
    labs(title = "Single Transaction Price per Location",
             x = "Location", y = "Price")
ggplotly(p6)

4. MC2 Visualization Insights

4.4 Question 4

Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. Please limit your response to 8 images and 500 words.

Because of limited time and efforts, I choose “2014-01-07” as an example to show the unofficial relationships among GAStech personnel. In this file we can find out who is in clusters in the same place then infer they may have unoffical relationships with each other.

tmap_mode("view")
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,alpha = NA,saturation = 1,
       interpolate = TRUE,max.value = 255) +
  tm_shape(Abila_shape)+
  tm_lines(col= "grey", lwd=1)+
  tm_shape(gps_path_selected)+
  tm_dots()

4.5 Question 5

Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why Please limit your response to 10 images and 500 words.

Fig5.1 Suspicious Night Transactions by Credit Card

Suspicious activity 1. I filter out the transaction records made at midnight. There are five transactions on 12th, 13th, and 19th. What I am curious about is all the credit card transactions at midnight are in the Kronos Mart. Does this store have special merchandise for sale? Or are these missing employees and suspects meeting at this place at midnight?

Fig 5.2 Suspicious Transaction Money

Suspicious Activity 2. An analysis of the transactions per employee reveals that Lucas Alcazar spent a large sum of money ($10,000) at Frydo’s Autosupply n’ More. What products were bought in this transaction?

Fig 5.3 Suspicious Night Transactions

Suspicious activity 3. Here I filter out the night transactions made at Kronos Mart. All these five transactions didn’t use loyalty card. Were they afraid to record their peronsal information in their loyalty card?

Fig 5.4 Place Employees Stayed More Than 5 Hours in the past 14 days

Fig 5.6 Example of Clusters

Fig 5.5 shows the locations where customers stayed more than 5 hours in a day. Fig 5.6 is the example of the home of the owner of car 10. In the Fig 5.5, if there are clusters with several dots in one place, I assume it is the home of the car owner.

This is not the end of the assignment. To view the solution for question 4 and 5, please click here(https://linlizhong.netlify.app/posts/2021-07-25-vast-challenge-mc2/)